Differential Expression Analysis

Differential Gene Expression Analysis Workflow


General idea behind RNAseq data analysis

General idea behind any statistical test

Normalisation

  • Counting estimates the relative counts for each gene

  • Does this accurately represent the original population of RNAs?

  • The relationship between counts and RNA expression is not the same for all genes across all samples

Library Size

Differing sequencing depth

Gene properties

Length, GC content, sequence

Library composition

Quantification is relative - changes in relative abundance for one gene will affect the relative abundances of other genes

“Composition Bias”

General principle behind normalisation

  • Normalization has two steps
    • Scaling
      • First get size factors or normalization factors
      • Usually one size factor per sample
      • Scale the counts by divide the raw counts of a sample with sample specific size factor
  • Transformation: Transform the data after scaling
    • Per million
    • log2
    • square root transformation
    • Pearson residuals (eg. sctransform)
  • Normalization removes technical variance but not biological variance
  • Normalization helps in making two samples comparable

Normalization by library scaling

Library Size Scaling

  • Normalise each sample by total number of reads sequenced.

  • Can also use another statistic similar to total count eg. median, upper quartile

  • Does not account for composition bias


DESeq2 analysis workflow


DESeq2 Normalisation

  1. Geometric mean is calculated for each gene across all samples.
  2. The counts for a gene in each sample is then divided by this mean.
  3. The median of these ratios in a sample is the size factor (normalization factor) for that sample.
  4. DESEq2 normalization corrects for library size and RNA composition bias
  5. Composition bias: Arise for example when only a small number of genes are very highly expressed in one sample but not in the other.

Differential Expression

Simple difference in means

Replication introduces variation

Differential Expression - Modelling population distributions

  • Normal (Gaussian) Distribution - t-test

  • Two parameters - \(mean\) and \(sd\) (\(sd^2 = variance\))

  • Suitable for microarray data but not for RNAseq data

Differential Expression - Modelling population distributions

  • Count data - Poisson distribution

  • One parameter - \(mean\) \((\mu)\)

  • \(variance\) = \(mean\)

Differential Expression - Modelling population distributions

  • Use the Negative Binomial distribution

  • In the NB distribution \(mean\) not equal to \(variance\)

  • Two paramenters - \(mean\) \((\mu)\) and \(dispersion\) \((\phi)\)

  • \(dispersion\) describes how \(variance\) changes with \(mean\)

Anders, S. & Huber, W. (2010) Genome Biology

Differential Expression - estimating dispersion

  • Estimating the dispersion parameter can be difficult with a small number of samples

  • DESeq2 models the variance as the sum of technical and biological variance

  • Esimate dispersion for each gene

  • ‘Share’ dispersion information between genes to obtain fitted estimate

  • Shrink gene-wise estimates towards the fitted estimates

Differential Expression - worrying dispersion plot examples


Bad dispersion plots from: https://github.com/hbctraining/DGE_workshop

DESeq2 results

Differential Expression - linear models

  • Calculate coefficients describing change in gene expression

  • Linear Model \(\rightarrow\) General Linear Model

Linear models

  • A model is a simplified representation of how we think different variables relate to each other.
  • Linear models are the most commonly used in statistical inference.

Linear models

  • A model is a simplified representation of how we think different variables relate to each other.
  • Linear models are the most commonly used in statistical inference.

Generalized Linear Models

  • A Linear model assumes the errors (residuals) are normally distributed around the fit line
  • A Generalized Linear Model uses a “link function” to enable the Linear Model to cope with other distributions e.g Negative Binomial

GLM for Differential Expression Analysis

GLM for Differential Expression Analysis

Common Experimental Designs

One factor - three levels

Two factors - two levels each

Two factors - two levels each - Additive Model

Two factors - two levels each - Interaction Model

Multiple testing correction

  • A gene with a significance cut-off of pval = 0.05, means there is a 5% chance it is a false positive.
  • If we test for 20,000 genes for differential expression at pval < 0.05, we would expect to find 1,000 genes by chance
  • If we found 3000 genes to be differentially expressed total, roughly one third of our genes are false positives!
  • The more genes we test, the more we inflate the false positive rate. This is the multiple testing problem.
  • We appy an adjustment to the pvalue to account for this - Benjamini-Hochberg (FDR).

DESeq2 results

  • baseMean - Mean across all samples.
  • log2FoldChange - log2(B) - log2(A) i.e. the difference between treatments
  • lfcSE - the standard error of the log2FoldChange
  • stat - the test statistic = log2FoldChange/lfcSE
  • pvalue - the p-value of the Wald test
  • padj - the p-value adjusted for multiple testing (false discovery rate)

Summary

  • Normalisation to account for technical variation (noise)
  • Use Negative Bionomial Distribution
  • Use a Generalized Linear Model to estimate coefficients
  • Test statistic is Fold Change / Standard Error of Fold Change
  • P value derived from test statistic
  • Multiple testing correction